-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for structured parameters #47
Conversation
@klueska: GitHub didn't allow me to request PR reviews from the following users: asm582. Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs. |
ab745f7
to
f848603
Compare
README.md
Outdated
@@ -117,11 +136,48 @@ Spec: | |||
... | |||
``` | |||
|
|||
**Note**: When running with `StructuredParameters` you can also check that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to publish NodeAllocationState
when running with structured parameters?
That's not going to get updated, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it did get updated. But how? Is that something that the kubelet plugin does?
I can imagine that it was easy to do this because the code was already there, but as an example of how to write a DRA driver I find it confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to make the minimal changes necessary to support both classic DRA and structured parameters in the same code base. It would definitely look different if this weren't a goal. In fact, I would probably move all the state of already-prepared claims out of the CRD and instead checkpoint in a file on the node. Is it worth making this separation in this PR into two separate plugins / controllers that get deployed depending on the withStructuredParameters
flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make the code more realistic for folks who only want to do structured parameters. I think it's worthwhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has now been updated to strip out support for classic DRA
|
||
klog.Infof("Created ResourceClaimParameters for GpuClaimParameters %s/%s", namespace, gpuClaimParameters.Name) | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a non-trivial amount of code. Would it be worth adding unit tests or do we leave that as an exercise to the reader?
The demo works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's always worth it to add unit tests :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to skip unit tests for now since this goes away in 1.31 anyway.
Next, deploy four example apps that demonstrate how `ResourceClaim`s, | ||
`ResourceClaimTemplate`s, and custom `ClaimParameter` objects can be used to | ||
request access to resources in various ways: | ||
```bash | ||
kubectl apply --filename=gpu-test{1,2,3,4}.yaml | ||
kubectl apply --filename=demo/gpu-test{1,2,3,4}.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Showing that parameters got converted when using structured parameters would be useful here. OTOH, it's not very impressive right now. So perhaps not (yet)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's not very interesting yet ... I will definitely add this once I open a PR introduce selection via a vendor-specific CRD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will wait to do this as part of the 1.31 update
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
f848603
to
1d1e362
Compare
19178d5
to
b6b1206
Compare
b6b1206
to
1e3c484
Compare
} | ||
|
||
func (d *driver) NodePrepareResources(ctx context.Context, req *drapbv1.NodePrepareResourcesRequest) (*drapbv1.NodePrepareResourcesResponse, error) { | ||
logger := klog.FromContext(ctx) | ||
logger.Info("NodePrepareResource", "numClaims", len(req.Claims)) | ||
klog.Infof("NodePrepareResource is called: number of claims: %d", len(req.Claims)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering why do you prefer using klog directly over context logger?
return nil | ||
}) | ||
|
||
prepared, err := d.state.Prepare(claim.Uid, allocated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like claim preparation logic is implemented it state
. I expected most of it here, in the driver.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does that leave for state
to do then? Are you suggesting I get rid of the state
indirection altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would leave at least checkpointing and probably other state-related operations. The actual device preparation should be decoupled from the state management in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how I feel about pulling the list of allocatable devices up into the driver -- this is the main thing that the state abstraction maintains, and "preparation" at this level is basically checking if the allocated devices are actually in the allocatable list, checkpointing them, and then returning the set of CDI devices associated with them. Maybe you just don't like the name prepare()
for this lower-level operation? Do you have another suggestion on the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had an impression that all resource preparation code is in the state.prepare. If this is not the case, than I'm ok with this.
looks good to me except of a couple of nits I've mentioned above. One thing that I've noticed that the device preparation is handled by the driver state. I'd suggest to move it out of there to the driver. I understand that for the demo driver it can be empty, but it shouldn't be in the state implementation in my opinion. |
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
1e3c484
to
49c8d0a
Compare
metadata: | ||
creationTimestamp: "2024-04-17T13:45:44Z" | ||
generateName: dra-example-driver-cluster-worker-gpu.resource.example.com- | ||
name: dra-example-driver-cluster-worker-gpu.resource.example.comxktph |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to separate random suffix from the name, e.g.
dra-example-driver-cluster-worker-gpu.resource.example.comxktph
-> dra-example-driver-cluster-worker-gpu.resource.example.com.xktph
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a question for @pohly -- this comes from his library for generating the name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GenerateName: dra-example-driver-cluster-worker-gpu.resource.example.com-
already has a trailing -
. But as that prefix is long, Kubernetes truncates it and drops the trailing -
.
One way to avoid that would be to truncate the <node>-<driver>-
string in advance if it is too long, for example by cutting out some characters in the middle of each sub-component. Would that be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comes from his library for generating the name
Just to be clear: that library produces generateName
, not the name
- that is the name generated by Kubernetes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
AddFunc: func(obj any) { | ||
unstructured, ok := obj.(*unstructured.Unstructured) | ||
if !ok { | ||
klog.Errorf("Error converting object to *unstructured.Unstructured: %v", obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
klog.Errorf("Error converting object to *unstructured.Unstructured: %v", obj) | |
klog.Errorf("Error converting object to *unstructured.Unstructured: %v", obj) | |
return |
Otherwise unstructured == nil is used on line 109
UpdateFunc: func(oldObj any, newObj any) { | ||
unstructured, ok := newObj.(*unstructured.Unstructured) | ||
if !ok { | ||
klog.Errorf("Error converting object to *unstructured.Unstructured: %v", newObj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
klog.Errorf("Error converting object to *unstructured.Unstructured: %v", newObj) | |
klog.Errorf("Error converting object to *unstructured.Unstructured: %v", newObj) | |
return |
return cdiDevices, nil | ||
} | ||
|
||
func (s *DeviceState) Unprepare(claimUID string) error { | ||
s.Lock() | ||
defer s.Unlock() | ||
|
||
if s.prepared[claimUID] == nil { | ||
checkpoint := newCheckpoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the checkpoint need to be loaded every time unprepare or prepare is called? Is it some sort of protection from kubelet-plugin Pod being offline / suspended temporarily? If not, my impression was that the checkpoint can be loaded just once when the kubelet-plugin object is being created and new state object is created. After state is ready, then the prepared claims need to be saved into the checkpoint every time Prepare
or Unprepare
is called trusting that State object is the source of truth, and ignoring old data in snapshots.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: byako, klueska The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR adds support for structured parameters to the example DRA driver for Kubernetes v1.30 (removing the support for "classic" DRA in the process). A new branch called
classic-dra
has been created to preserve support for it.Closes #44